Kristian K Ullrich
Max Planck Institute for Evolutionary Biology
A matter of distance.
Synonymous and non-synonymous substitutions: nucleotide substitutions might lead to changes on amino acid level.
Typically used to:
An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.
MSA2dist Features:
From Bioconductor:
From GitHub:
Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.
Source: Pico-PLAZA 3.0
Where to find: the data/ directory in the GitHub repo associated with this presentation.
data
├── annotation.rda
├── cds.rda
├── clusters.rda
├── data_acquisition.Rmd
├── diamond_list.rda
├── profiles.rda
└── proteomes.rda
cds.rda: DNAStringSet object.clusters.rda: Pre-calculated syntenet clusters.cds# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(tidyr)
library(dplyr)
library(stringr)
# Load data set cds
load(here::here("data", "cds.rda"))
# Inspect data
head(names(cds))[1] "CNC64A_028G00030" "CNC64A_028G00040" "CNC64A_028G00070" "CNC64A_028G00240"
[5] "CNC64A_028G00150" "CNC64A_028G00170"
clusters Gene Cluster
1 Chlo_CNC64A_028G00030 1
2 Chlo_CNC64A_028G00040 2
3 Chlo_CNC64A_028G00070 3
4 Chlo_CNC64A_028G00080 4
5 Chlo_CNC64A_028G00110 5
6 Chlo_CNC64A_028G00140 6
[1] 22605
1 2 3 4 5 6
38 38 37 2 9 4
# Get first cluster of size 10 (fc10)
fc10 <- which(table(clusters$Cluster)==10)[1]
my_cluster <- clusters %>%
dplyr::filter(Cluster==fc10)
head(my_cluster) Gene Cluster
1 Oluc_OL13G02120 119
2 Bpra_BPRRCC1105_04G02530 119
3 Osp_ORCC809_13G00870 119
4 Oluc_OL21G00860 119
5 Omed_OM_08G02190 119
6 Msp_MRCC299_06G03520 119
# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
head(my_cluster) Gene Cluster GeneID
1 Oluc_OL13G02120 119 OL13G02120
2 Bpra_BPRRCC1105_04G02530 119 BPRRCC1105_04G02530
3 Osp_ORCC809_13G00870 119 ORCC809_13G00870
4 Oluc_OL21G00860 119 OL21G00860
5 Omed_OM_08G02190 119 OM_08G02190
6 Msp_MRCC299_06G03520 119 MRCC299_06G03520
DNAStringSet object of length 6:
width seq names
[1] 1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL13G02120
[2] 1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
[3] 1605 ATGTCGCTCAAATCGATCAACCC...GCATCAACATGCGAAAGAAATAG ORCC809_13G00870
[4] 1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL21G00860
[5] 1605 ATGAGTTTGAAATCCGTCAACCC...GGGTGAACATGCGCAAACGATAA OM_08G02190
[6] 1707 ATGCAGGGCATGCAAAGCCCCAT...CTCCCGAAAAACTGGGGGAGTAA MRCC299_06G03520
To calculate a coding sequence alignment for two sequences:
# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_alnDNAStringSet object of length 2:
width seq names
[1] 1623 ATGTCG------CTCAAGTCCAT...GCAAGCGA------------TGA OL13G02120
[2] 1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
To calculate dN/dS from this alignment:
# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks Comp1 Comp2 seq1 seq2 ka ks vka
1 1 2 OL13G02120 BPRRCC1105_04G02530 0.1640926 1.849032 0.0003232055
vks
1 0.4047548
Other Biostrings::pairwiseAlignment options can be set:
type = "global", substitutionMatrix = "BLOSUM62"
gapOpening = 10, gapExtension = 0.5
To calculate all pairwise combinations:
start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()
head(my_cds_kaks, n = 4) Comp1 Comp2 seq1 seq2 Method Ka Ks
result.1 1 2 OL13G02120 BPRRCC1105_04G02530 YN 0.145529 4.33579
result.2 1 3 OL13G02120 ORCC809_13G00870 YN 0.0328853 2.73769
result.3 1 4 OL13G02120 OL21G00860 YN NA NA
result.4 1 5 OL13G02120 OM_08G02190 YN 0.0570134 4.34114
Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1 0.0335645 3.77014e-136 1602 324.101 1277.9 NA
result.2 0.0120121 3.37981e-125 1602 313.774 1288.23 NA
result.3 NA NA 1602 313.942 1288.06 NA
result.4 0.0131333 1.79968e-161 1602 326.422 1275.58 NA
Substitutions S-Substitutions N-Substitutions
result.1 443 274.184 168.816
result.2 238 196.59 41.4101
result.3 0 NA NA
result.4 326 255.985 70.0148
Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1 NA NA
result.2 NA NA
result.3 NA NA
result.4 NA NA
Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1 0.993263 1.65453:1.65453:1:1:1:1
result.2 0.562657 2.66376:2.66376:1:1:1:1
result.3 NA 2:2:1:1:1:1
result.4 0.929943 1.39012:1.39012:1:1:1:1
GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.537893(0.536969:0.358595:0.718115) NA NA NA NA
result.2 0.561682(0.562617:0.349533:0.772897) NA NA NA NA
result.3 0.561371(0.568224:0.35514:0.760748) NA NA NA NA
result.4 0.542368(0.561682:0.35514:0.71028) NA NA NA NA
How long did it take?
ka_mat <- my_cds_kaks %>%
dplyr::select(seq1, seq2, Ka) %>%
dplyr::mutate(Ka = as.numeric(Ka)) %>%
tidyr::pivot_wider(names_from = "seq1", values_from = Ka)
head(ka_mat)# A tibble: 6 × 10
seq2 OL13G02120 BPRRCC1105_04G0… ORCC809_13G00870 OL21G00860 OM_08G02190
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BPRRCC110… 0.146 NA NA NA NA
2 ORCC809_1… 0.0329 0.151 NA NA NA
3 OL21G00860 NA 0.146 0.0329 NA NA
4 OM_08G021… 0.0570 0.146 0.0719 0.0570 NA
5 MRCC299_0… 0.673 0.699 0.653 0.673 0.715
6 MP09G03570 0.666 0.696 0.646 0.666 0.678
# … with 4 more variables: MRCC299_06G03520 <dbl>, MP09G03570 <dbl>,
# MP09G02740 <dbl>, OT_13G02270 <dbl>
# Extract sequence names
seq_names <- c(colnames(ka_mat)[-1], rev(ka_mat$seq2)[1])
# Add NA to fill into square distance matrix
ka_mat <- NA |> rbind(ka_mat[, -1]) |> cbind(NA) |>
tidyr::as_tibble()
# Set column names
colnames(ka_mat) <- seq_names
head(ka_mat)# A tibble: 6 × 10
OL13G02120 BPRRCC1105_04G02530 ORCC809_13G00870 OL21G00860 OM_08G02190
<dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA
2 0.146 NA NA NA NA
3 0.0329 0.151 NA NA NA
4 NA 0.146 0.0329 NA NA
5 0.0570 0.146 0.0719 0.0570 NA
6 0.673 0.699 0.653 0.673 0.715
# … with 5 more variables: MRCC299_06G03520 <dbl>, MP09G03570 <dbl>,
# MP09G02740 <dbl>, OT_13G02270 <dbl>, MRCC299_06G02640 <lgl>
ape::bionjs reconstructs a phylogenetic tree from a distance matrix with possibly missing values:
Here, a pre-calculated MSA is necessary:
Plot average behavior of each codon:
Kristian K Ullrich @kullrich